SBARS: fast creation of dotplots for DNA sequences on different scales using GA-, GC-content

نویسندگان

  • Maxim I. Pyatkov
  • Anton N. Pankratov
چکیده

SUMMARY Structural analysis of long DNA fragments, including chromosomes and whole genomes, is one of the main challenges in modern bioinformatics. Here, we propose an original approach based on spectral methods and its implementation called SBARS (Spectral-Based Approach for Repeats Search. The main idea of our approach is that repeated DNA structures are recognized not within the nucleotide sequence directly but within the function derived from this sequence. This allows us to investigate nucleotide sequences on different scales and decrease time complexity for dotplot creation down to [Formula: see text]. AVAILABILITY AND IMPLEMENTATION Pre-compiled versions for Windows and Linux and documentation are available at http://mpyatkov.github.com/sbars/.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

Repetition and Language Models and Comparable Corpora

I will discuss a couple of non-standard features that I believe could be useful for working with comparable corpora. Dotplots have been used in biology to find interesting DNA sequences. Biology is interested in ordered matches, which show up as (possibly broken) diagonals in dotplots. Information Retrieval is more interested in unordered matches (e.g., cosine similarity), which show up as squa...

متن کامل

Genomics dataset of unidentified disclosed isolates

Analysis of DNA sequences is necessary for higher hierarchical classification of the organisms. It gives clues about the characteristics of organisms and their taxonomic position. This dataset is chosen to find complexities in the unidentified DNA in the disclosed patents. A total of 17 unidentified DNA sequences were thoroughly analyzed. The quick response codes were generated. AT/GC content o...

متن کامل

Are isochore sequences homogeneous?

Three statistical/mathematical analyses are carried out on isochore sequences: spectral analysis, analysis of variance, and segmentation analysis. Spectral analysis shows that there are GC content fluctuations at different length scales in isochore sequences. The analysis of variance shows that the null hypothesis (the mean value of a group of GC contents remains the same along the sequence) ma...

متن کامل

Determination of GC content of Thermotoga maritima, Thermotoga neapolitana and Thermotoga thermarum strains: A GC dataset for higher level hierarchical classification

A total of 16 strains of hyperthermophilic Thermotoga complete genome sequences viz. Thermotoga maritima (AE000512, CP004077, CP007013, CP011107, NC_000853, NC_021214, NC_023151, NZ_CP011107, CP011108, NZ_CP011108, CP010967 & NZ_CP010967), Thermotoga neapolitana (CP000916, & NC_011978) and Thermotoga thermarum (CP002351 & NC_015707) complete genome sequences were retrieved from NCBI BioSample d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 30 12  شماره 

صفحات  -

تاریخ انتشار 2014